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ABSTRACT 

Scientific and engineering applications often involve structured meshes. These meshes may be nested (for 
multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). 
In this paper, we present a combined runtime and compile-time approach for parallelizing these applications 
on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed 
and implemented a runtime library which can be used to port these applications on distributed memory 
machines. The library is currently implemented on several different systems. To further ease the task of 
application programmers, we have developed methods for integrating this runtime library with compilers for 
HPF-Iike parallel programming languages. We discuss how we have integrated this runtime library with the 
Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate 
the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a 
multigrid code. Our experimental results show that our primitives have low runtime communication overheads. 
Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting 
calls to the runtime library. 
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assume all responsibility for the contents of the paper. 




1 Introduction 


In recent years, distributed memory parallel machines have been widely recognized as the most likely means 
of achieving scalable high performance computing. However, there are two major reasons for their lack of 
popularity among developers of scientific and engineering applications. First, it is very difficult to parallelize 
application programs on these machines. Second, it is not easy to get good speed-ups and efficiency on commu- 
nication intensive applications. Current distributed memory machines have good communication bandwidths, 
but they also have high startup latencies which often result in high communication overheads. 

Recently there have been major efforts in developing programming language and compiler support for 
distributed memory machines. Based on the initial works of projects like Fortran D [13, 22] and Vienna For- 
tran [6,41] , the High Performance Fortran Forum has recently proposed the first version of High Performance 
Fortran (HPF) [12]. HPF allows programmer to specify the layout of distributed data and specify parallelism 
through operations on array sections and through parallel loops. Proposed HPF compilers are being designed 
to produce Single Program Multiple Data (SPMD) Fortran 77 (F77) code with message passing and/or runtime 
communication primitives. HPF offers the promise of significantly easing the task of programming distributed 
memory machines and making programs independent of a single machine architecture. 

Reducing communication costs is crucial in achieving good performance on applications [20, 21]. While cur- 
rent systems like the Fortran D project [22] and the Vienna Fortran Compilation system [6] have implemented 
a number of optimizations for reducing communication costs (like message blocking, collective communication, 
message coalescing and aggregation), these optimizations have been developed only in the context of regular 
problems (i.e. in code having only regular data access patterns). Special effort is required in developing 
compiler and runtime support for applications that do not necessarily have regular data access patterns. Our 
group has already developed compiler embedded runtime support for completely irregular computations (i.e. 
codes in which distributed arrays are accessed based on indirection arrays) [10, 11, 18]. 

One class of scientific and engineering applications involves structured meshes. These meshes may be 
nested (as in multigrid codes) or may be irregularly coupled (called Multiblock or Irregularly Coupled Regu- 
lar Mesh Problems) [9]. Multigrid is a common technique for accelerating the solution of partial-differential 
equations [5, 30]. Multigrid codes employ a number of meshes at different levels of resolution. The restriction 
and prolongation operations for shifting between different multigrid levels require moving regular array sec- 
tions [19] with non-unit strides. In multiblock problems, the data is divided into several interacting regions 
(called blocks or subdomains). There are computational phases in which regular computation is performed 
on each block independently. Boundary updates require communication between blocks, which is restricted 
to moving regular array sections (possibly including non-unit strides). 

Multiblock grids are frequently used for modeling geometrically complex objects which cannot be easily 
modeled using a single regular mesh [2, 3, 24, 29, 36]. In Figure 1, we show how the area around an 
aircraft wing has been modeled with a multiblock grid. Multiblock applications are used in important grand- 
challenge applications like air quality modeling [28, 32], computational fluid dynamics [40], structure and 
galaxy formation [25, 38], simulation of high performance aircrafts [1, 8, 31], large scale climate modeling [14], 
reservoir modeling for porous media [14], simulation of propulsion systems [14], computational combustion 
dynamics [14], geophysical databases [33], and land cover dynamics [23]. 

In this paper we present a combined runtime and compile-time approach for parallelizing this general 
class of applications on distributed memory machines. We present runtime support that we have designed and 
implemented for parallelizing these applications on distributed memory machines in an efficient, convenient and 
machine independent manner. We state the extensions required to the current version of HPF for handling 
block structured codes. We present methods by which the compilers for HPF style parallel programming 
languages can automatically generate calls to our runtime primitives. We discuss how we have integrated our 
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Figure 1: Block structured grid around a wing, showing an interface between blocks 

runtime primitives with the Fortran 90D compiler being developed at Syracuse University [4]. While the 
design of our runtime system was initially motivated by multigrid and multiblock applications, our primitives 
can also be used in many cases for parallelizing computations with regular data access patterns. 

We present experimental results to demonstrate the efficacy of our approach. We have experimented with 
one multiblock application [40] and one multigrid code [35]. We have measured the runtime overheads of our 
primitives. We have compared the performance of compiler parallelized multiblock and multigrid templates 
with those of the hand parallelized (i.e. parallelized by inserting calls to the runtime library by hand) versions. 
Our experimental results show that the primitives have low runtime communication overheads and the compiler 
parallelized codes performs within 20% of the codes parallelized by inserting calls to runtime library by hand. 
We discuss the optimizations that we have used to achieve this performance. 

Several other researchers have also developed runtime libraries or programming environments for multiblock 
applications. Baden [24] has developed a Lattice Programming Model (LPAR). This system, however, achieves 
only coarse grained parallelism since a single block can only be assigned to one processor. Quinlan [26] has 
developed P++, a set of C++ libraries for grid applications. While this library provides a convenient interface, 
the libraries do not optimize communication overheads. Our library, on the contrast, reduces communication 
costs by using message aggregation. 

The rest of this paper is organized as follows. In Section 2, we introduce the runtime library that we have 
developed. In Section 3, we discuss the regular section analysis required for efficiently generating schedules 
for regular section moves, one of the communication primitives in our library. In Section 4, we state the 
extensions required to the current version of HPF and discuss how the compiler recognizes the data access 
patterns which can be handled using the runtime primitives that we have developed. In this section we also 
discuss the compiler transformations for automatically inserting the calls to the runtime library routines. In 
Section 5, we present experimental results to study the communication overheads of our primitives and the 


2 






effectiveness of the compiler. We conclude in Section 6. 


2 Runtime Support 

In this section we present the details of the runtime support library that we have designed for parallelizing 
multiblock and multigrid codes on distributed memory machines. We discuss the nature of computation 
and communication in multiblock and multigrid codes and also describe how the library primitives facilitate 
parallelization of these applications. 

The set of runtime routines that we have developed is called the Multiblock Parti library [39]. In summary, 
these primitives allow an application programmer or a compiler to 

• Lay out distributed data in a flexible way, to enable good load balancing and minimize interprocessor 
communication, 

• Give high level specifications for performing data movement, and 

• Distribute the computation across the processors. 

We have designed the primitives so that communication overheads are significantly reduced (by using 
message aggregation). These primitives provide a machine-independent interface to the compiler writer and 
applications programmer. We view these primitives as forming a portion of a portable, compiler independent, 
runtime support library. 

This library is currently implemented on the Intel iPSC/860, the Thinking Machines’ CM-5 and the PVM 
message passing environment for network of workstations [15]. The design of the library is architecture inde- 
pendent and therefore it can be easily ported on any distributed memory parallel machine or any environment 
which supports message passing (e.g. Express). The library primitives can currently be invoked from Fortran 
or C programs. Programmers can port their Fortran or C programs on distributed memory machines by 
manually inserting calls to the library routines. The resulting program has Single Program Multiple Data 
(SPMD) model of parallelism. 

While the design of our runtime system was initially motivated by multigrid and multiblock applications, 
our runtime primitives are also applicable in many cases for regular codes. In Section 4, we briefly mention 
other cases when our primitives can be used. In this section, however, we discuss the details of our runtime 
support in context of multiblock and multigrid applications. 

2.1 Multiblock and Multigrid Applications 

For a typical multiblock application, the main body of the program consists of an outer sequential (time 
step) loop, and an inner parallel loop. The inner loop iterates over the blocks of the problem, after applying 
boundary conditions to all the blocks (including updating interfaces between blocks). Applying the boundary 
conditions involves interaction (communication) between the blocks. In the inner loop over the blocks, the 
computation in each block is typically sweeps over mesh in which each mesh-point interacts only with its 
nearby neighbors. Since in these applications, there are computational phases which involve interactions only 
within each block, communication overheads are reduced if each block is not divided across a large number of 
processors. So, blocks may have to be distributed onto subsets of processor space. Since the number of blocks 
is typically quite small (i.e. at most a few dozens), at least some of the blocks will have to be distributed 
across multiple processors. Partitioning of the parallel loop across the block is the source of the coarse-grained 
parallelism for the application. Furthermore, within each iteration of the inner loop fine-grained parallelism 
is available in the form of (large) parallel loops, iterating over the elements of the blocks. 



Now, we briefly discuss the runtime support required for parallelizing multiblock applications. First, there 
must be a means for expressing data layout and organization on the processors of the distributed memory 
parallel machine. We need compiler and runtime support for mapping blocks (arrays) to subsets of the 
processor space. Second, there must be methods for specifying the movement of data required. Two types of 
communication are required in multiblock applications. The interaction between different blocks requires the 
movement of regular array sections. The inner loop involves interactions among neighboring elements of the 
grids. Since blocks may be partitioned across processors, this also requires communication. Third, there must 
be some way of distributing loops iterations among the processors and converting global distributed arrays 
references to local references. 

Multigrid is a common technique for accelerating the solution of partial differential equations. Multigrid 
codes employ a number of meshes at different levels of resolution. Multigrid codes have phases of restriction 
(in which a coarse grid is initialized based upon a finer grid), prolongation (in which a coarse grid is copied into 
a finer grid with non-unit stride and then the other elements on the fine grid are computed by interpolation) 
and sweeps over individual grids. Coarse meshes may be obtained from fine grid by coarsening by the same 
factor or different factors along different dimensions. Accordingly, each grid may be distributed over the entire 
set of processors, or some grids may have to be distributed over parts of the processor space. A particular 
form of multigrid technique is the semi-coarsening multigrid technique [34]. Semi-coarsening multigrid works 
as follows. Starting from the finest grid, coarser grids are generated by coarsening by different factors along 
different dimensions. There may be many grids, having the same number of mesh points, but obtained by 
different coarsening factor along each dimensions (i.e. they may have different shapes). In parallelizing such 
an application, communication costs can be reduced while maintaining good load balance by the following 
mapping scheme. The finest grid is mapped over the entire processor space. Different grids at the same level 
(i.e. having the same total number of mesh points but obtained by different coarsening factors along each 
dimension) are mapped to disjoint parts of the processor space. In Figure 2 we show how the semi-coarsened 
grids are generated and how they can be mapped to the processor space. 

The runtime support requirements for the multigrid codes is as follows. As with multiblock codes, we need 
to be able to map grids over subsets of the processor space. The restriction and prolongation steps require 
regular section moves between grids at different levels of resolution. Again, communication arises because each 
grid may be distributed across multiple processors and computation within each block requires near neighbor 
interactions. Similarly, the interpolation required during the prolongation step also involves interaction among 
neighboring elements. Also, support for distributing loop iterations and transforming global distributed array 
references to local references is required. 

2.2 Multiblock Parti Primitives 

We now discuss the design of our runtime library [39]. Since, in typical multiblock and multigrid applications, 
the number of blocks and their respective sizes is not known until runtime, the distribution of blocks onto 
processors is done at runtime. The distributed array descriptors (DAD) [4] for the arrays representing these 
blocks are, therefore, generated at runtime. Distributed array descriptors contain information about the 
portions of the arrays residing on each processor, and are used at runtime for performing communication and 
distributing loops iterations. We will not discuss the details of the primitives which allow the user to specify 
data distribution. For more details, see [39]. 

As we discussed previously, two types of communication are required in both multiblock and multigrid 
applications. Inter-block communication is required because of boundary conditions between blocks (in multi- 
block codes) and restrictions and prolongations between grids at different levels of resolution (in multigrid 
codes). Since the data that needs to be communicated is always a regular section of an array, this can be 
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Figure 2: Semi-coarsened grids and their mapping to processors 
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handled by primitives for regular section move. A regular section move copies a regular section of one dis- 
tributed array into regular section of another distributed array, potentially involving changes of offset, stride 
and index permutation. Intra-block communication is required because of partitioning of blocks or grids across 
processors. The data access pattern in the computation within a block or grid is regular. This implies that 
the interaction between grid points is restricted to nearby neighbors. The interpolation required during the 
prolongation step in multigrid codes also involves interaction among the neighboring array elements. Such 
communication is handled by allocation of extra space at the beginning and end of each array dimension on 
each processor. These extra elements are called overlap , or ghosts cells [6, 16, 27]. Depending upon the data 
access pattern in a loop, the required data is copied from other processors and is stored in the overlap cells. 

In our runtime system, communication is performed in two phases. First, a subroutine is called to build 
a communication schedule that describes the required data motion, and then another subroutine is called to 
perform the data motion (sends and receives on a distributed memory parallel machine) using a previously 
built schedule. Such an arrangement allows a schedule to be be used multiple times in an iterative algorithm. 
The communication primitives include a procedure Overlap-CelLFillJScked, which computes a schedule that 
is used to direct the filling of overlap cells along a given dimension of a distributed array. Communication for 
filling in the overlap cells has been implemented in other systems for regular computations [6, 16, 27], so we 
will not be discussing the details here. The primitive RegularSection-CopySched carries out the preprocessing 
required for performing the regular section moves. In section 3, we discuss the details of the regular section 
analysis required for efficiently generating the schedule for regular section move. 

The schedules produced by Overlap -CelLF ill Sched and Regulars ection-Copy.Sched are employed by a 
primitive called Data.Move that carries out both interprocessor communication (sends and receives) and 
intra-processor data copying. 

The final form of support provided by the multiblock Parti library is to distribute loop iterations and 
transform global distributed arrays references into local references. Our library distributes loop iterations 
based upon the owners compute rule, which means that a particular loop iteration is executed by the processor 
owning the left-hand side array element written into during that iteration. As we discussed earlier, we prefer 
to generate the Distributed Array Descriptor (DAD) for the arrays at runtime. This means that global indices 
can be translated to local indices only at runtime and not at compile-time. Two primitives, LocaLLowerJdound 
and LocaLUpper .Bound, are provided for transforming loop bounds (returning, respectively, the local lower 
and upper bounds of a given dimension of the referenced distributed array) based upon the owners compute 
rule. Primitives globaLtoJocal and local Jo -global are also available for translating a global index into local 
index and translating a local index on a processor into global index respectively. 

3 Regular Section Analysis 

In this section we discuss the regular section analysis required for efficiently generating schedules for regular 
section moves (i.e. for implementing the primitive Regular JSection-CopySched) . By regular section analysis 
we mean how each processor can determine, for each other processor, the exact parts of the distributed array 
it needs to send and receive, given the source and the destination regular sections in global coordinates. In our 
current system, this analysis is always done at runtime. However, if the distributions of source and destination 
distributeed arrays and description of source and destination regular sections are available at compile-time, 
then this analysis can be done at compile-time as well. In separate works, Chatterjee et al [7], Stichnoth [37] 
and Gupta et al [17] have developed compile-time methods for analyzing and generating communication 
associated with HPF’s forall statements and/or F90 style array expressions. While their solutions work for 
the general case when the data distributions are block-cyclic, their methods require that the data distribution 
be known at the compile-time and the exact description of the statement be available in terms of compile-time 
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constants. We are interested in techniques which can be used at runtime and are specialized towards the 
particular communication patterns associated with multigrid and multiblock applications. Also, since these 
applications typically need block distribution, we restrict to describing our analysis when the data is block 
distributed along each dimension. 

We assume that the array indices always start from 0 (as in the C programming language). The processors 
owning source (or, destination) array can be viewed as forming an r-dimensional virtual processor grid. A 
processor p in this processor grid has coordinates ■ ,Pr}* We assume that the numbering starts from 

zero in each dimension in the processor grid. 

The source regular section, denoted by S, is part of the source distributed array s. 

S = {(sJo i : sJii \ : sstr i), (sJo2 : s_/iz '2 : sstr2), . . . , ( sJo r : sJii r : s_sfr r )} 

sJoi, sMii and sstri are respectively the lowest index, highest index and the stride along the i th dimension 
(in global indices). The regular section S defines a set of array elements. An array index e, along the i th 
dimension is said to be a part of the regular section iff 3/, (an integer) s.t. 


e,- = sJoi + * sstri, 


(3.0.1) 


where, 


0 < /, < 


sJiii — sJoi 

S-StVi 


An array element whose indices are (e\ ,^ 2 , . . . , e r ) belongs to the regular section S iff, 

Vi, 1 < i < r, the array index e,- along the i th dimension belongs to the regular section S. 

We will use this format to describe all regular sections. The destination regular section, denoted by V, is 
a part of the destination array d. 


V = {(dJo i : dJii\ : dstr\), (dJo2 : dJii 2 : dstr ?), . . . , ( dJo r : d.hi r : d_s 2 r r )} 

Regular section moves can involve index permutations. We denote by im(i) the destination dimension 
which is accessed by the same loop index as the i th source dimension. 

For each processor owning part of the source regular section, we want to determine the set of local elements 
that it will be sending to each processor owning part of the destination regular section. We call these sets of 
elements send sets. Similarly, for each processor owning part of the destination, we want to determine the set 
of local elements that it will be receiving from each processor owning part of the source. We call these sets of 
elements receive sets. Here we just discuss the analysis that a particular source processor does to compute the 
send sets. The analysis for determining the receive sets is completely analogous and is therefore not described. 

The steps we follow for computing the send sets are as follows. For a processor p , we determine the part 
of regular section S that it owns, that is, we restrict the section S on the processor p (which is denoted by 
S f (p)). Next, we take a transformation of the section S'(p)) to map it from the source regular section S 
to the destination regular section V. The resulting section is denoted by V f {p). We next determine the set 
of destination processors which own part of the section P ; (p), i.e. the destination processors to whom the 
processor p will be communicating. For each such processor q , we restrict the section 'P'(p) to determine the 
part that q owns, calling it V N (p,q). In the last step, the section T> n {p,q) is mapped back to the source, the 
resulting section is denoted by S ,f (p,q). S ,f (p, q) is the send set that the source processor p will be sending to 
the destination processor q. 

We now present the details of the steps mentioned above. We consider a particular processor p which owns 
part of the source array s, of which the regular section S is a part. Let Uof(p) and lhif(p) be the lowest 
and the highest points along the i ih dimension (in global indices) that the processor p owns. Since the data 
distribution is block, the processor p owns a contiguous chunk of data from llof(p) to lhif(p). 
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3,1 Restricting S to the Processor p 

We now compute the part of the regular section S that the processor p owns. This is denoted by «S'(p). 
Through-out our discussion, all regular sections will always be described in global indices. Given the global 
coordinates, any individual processor can always determine the corresponding local (on processor) indices. 


S'{p) = {(s-/oi(p), sM\(p), sstr\(p )) } .... (8Jo* r (p), sJii' r (p) } sMr' r (p))} 

where, 

sJo'i(p) 

s-hi'i(p) 

sstr'iip) 

Since the data is block distributed, sstrl(p) is sstri. sJoJ(p) is the first index along the i th dimension 
which is part of the regular section S and is owned by the processor p. In the calculation of sJo'^p) there are 
two cases, depending upon whether sJo{ > Uof (p) or sJoi < llof(p). If s_/o, > llof (p) then the index s_/o, 
is on the processor p. Therefore, sJo'^p) = . Alternatively, if sJo, < //of(p), then the expression for 

sJo^p) reduces to s_/o, + \(llof(p) — sJoi)/sstri\ • s_s<r t . This is the first index after //of (p) which is part 
of the regular section. Note that s_/itj(p) is not necessarily a member of the regular section S. It just needs 
to be greater than the last member of regular section on the processor q , so that the loop accessing successive 
indices in the section terminates correctly. 

If the processor p does not own any part of the regular section along the i ih dimension, then the above 
expressions will give a value of s_/oJ (p) which is greater than the value of s_hij(p). In general, if 

3 i, 1 < i < r, s.t. sJoJ(p) > sJii'^p) 

then the processor p does not own any part of the regular section S . 


.Mr, 

J s.stri 1 

(3.1.1) 

min(//iif (p), sJiii ) 

(3.1.2) 

sstri 

(3.1.3) 


3,2 Mapping $'(p) to the Destination 

Next, we determine the corresponding section in the destination array (i.e. part of the destination regular 
section that will be received from the processor p). This is denoted by X>'(p). 

V(p) = {(dJo\(p), dJii\ (p), dstr[ (p)), . . . , (dJo' T (p), dJii' r (p), dMr' r (p))} 


where, 


j 

= im(i) 


(3.2.1) 

dJo'jip) 

= dJoj + 

sMM-sM 

sstri 

(3.2.2) 

dJii'j (p) 

— d-loj -f- 

sstri J 3 

(3.2.3) 

cLstr'(p) 

= dstrj 


(3.2.4) 


Since the array is distributed by blocks, cLstr' (p) is djstrj. The expressions for dJo f -(p) and d_/u'(p) 
follow from finding the indices in the destination regular section which correspond to the indices sJoJ(p) and 
s_/iij(p), respectively, in the source regular section. 
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3.3 Restricting 2^(p) to Processor q 

We first determine the set of processors to which the source processor p will send data. The processors owning 
parts of the destination array form an r-dimensional virtual processor grid. A processor q , owning part of the 
destination array, has coordinates {gi,g 2 , * • >9r} in this processor grid. We assume that each processor owns 
sizt{ indices along the i ih dimension. 

We denote by g_mm,(p) and qjmaxi{p) the lowest and highest coordinates along the i th dimension of the 
processors which own part of the regular section £>'(p). 

g_mm,(p) ~ 
qjmaxi(p) = 

A processor q having coordinates {gi, g 2 , . . . , g r } will receive data from the source processor p iff 

Vi, 1 < i < r, qjmini(p) < g,- < g_max,*(p) 

Consider a particular processor q which will receive data from the source processor p. Suppose that the 
start and end points along the i th dimension on this processor are llof(q) and lhif(q) respectively. We denote 
the part of the destination regular section that the processor q will receive from the processor p by V ,f {p ) q). 

V\p, q) = {( dJo"{p , g), dJii\'(p, g), dstr’^p, q )), . . . , (<f Jo"(p, g), dJi%(p, g), djstr^p, g))} 

where, 


<Uo"(p , 9 ) 

= dAo'i + 

max(0,llo?(q)- dJo'^p))] 

dstri 

dstri 

(3.3.3) 

d-hi"(p, 9 ) 

= min(lh 

if ( 9)1 dJli'i(p)) 

(3.3.4) 

d_str,''(p, 9 ) 

~ dstr t 


(3.3.5) 


The reasoning behind the correctness of the above expressions is the same as that used in determining S ' 
from 5, as discussed in Section 3.1. 



3.4 Mapping T>"(p, q) to the Source 

Next, we determine the equivalent part of the regular section ^(p, g) on the source side (i.e. the part of the 
source regular section which the processor p sends to the processor g). We denote this by «S /; (p, g). 

S"{p, q ) = {(sJo"(p, q), sJii"{p , q), s^str"(p, 9 )), . . . , ( sJo"(p , q), sJii"(p , q), sstr"(p, 9 ))} 

where, 


j = im(i) 
sJo"(p,q) = sJoi + 

s-hi"(p , 9 ) = sM + 

s.str"(p, q) = sstri 


dJo"(p, 9) ~ 

dstrj 

dJiij(p , g) — dJoj 
dstrj 


sstri 
• sstri 


(3.4.1) 

(3.4.2) 

(3.4.3) 

(3.4.4) 


The reasoning behind the correctness of the these expressions is the same as that used in determining T>'(p) 
from S'(p) in Section 3.2. 
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3.5 Discussion 


All the calculations described above are performed by the processor p locally and do not involve communication 
with any other processor. Therefore, send and receive sets can be generated efficiently. Based on the calculation 
of S f \ the processor p knows the contents of the message that it must send to processor q . However, when 
processor q receives this message, it does not have any information about which local memory locations each 
element of the message must be copied into. To facilitate this, each destination processor computes the set 
of (local) elements that it will receive from each source processor. The calculations for computing these 
receive sets are completely analogous to the computations for the send sets. Therefore, we do not describe 
the computation of the receive sets here. The source processor p always sends the set of elements it needs to 
send to the processor q in a single message, packed in the column major fashion. Processor q can then use the 
receive set information to copy the elements in the received message into the appropriate local elements. 

An alternative to this scheme is that the message sent by the source processor p also contains information 
about what local memory location at the destination processor q each of elements packed in the message needs 
to be copies to. The destination processor q can then copy elements of the message into its local elements based 
upon this information. This approach does save some computation at the destination processors. However, 
the size of the messages increases significantly because of the extra information that needs to be sent. In 
our implementation, we have chosen to compute both the send and receive sets, since on current distributed 
memory machines, this is less expensive than communicating the receive set information. 


3.6 Example 

Consider a regular section move that involves a source array of size 100 * 100 and a destination array of size 
50 * 100. The source array is block distributed over a 2 * 2 virtual processor grid and the destination array 
is block distributed over a 4 * 1 virtual processor grid. The source and the destination regular sections are: 
S = {(10 : 60 : 2), (10 : 70 : 3)} and, V = {(10 : 30 : 1), (5 : 80 : 3)}. The first dimension of the source regular 
section is aligned to the second dimension of the destination regular section and the second dimension of the 
source regular section is aligned to the first dimension of the destination regular section i.e. im(l) = 2 and 
im(2) = 1. 

We consider the source processor p with coordinates {1,0}. The part of the global array that this processor 
owns is llof(p) = 50, //iif(p) = 99, //of(p) = 0 and /hif(p) = 49. 

The part of the source regular section that processor p owns («S'(p)) is given by 

S'(p) = {( 50: 60: 2), (10: 49: 3)} 

The corresponding section on the destination side (£>'(p)) is given by 

D'(p) = {(10: 23:1), (65: 80: 3)} 

Next, the processor p determines the set of destination processors with whom it will be communicating. 
We have for the destination array, size i = 50 and size-i — 25. 

This gives, p_mini(p) = 0, p-maxi(p) = 0, p_miri 2 (p) = 2, and p-maa^p) = 3. The destination 
processors the source processor p will communicate with are the ones with grid coordinates {0, 2} and {0, 3}. 
Consider the destination processor q with coordinates {0,3}. The part of the destination array that processor 
q owns is given by llof(q) = 0, lhif(q) = 49, Ho^iq) = 75 and Ihi^iq) = 99. 

The part of regular section which the processor p will be sending to the processor q (2>"(p,g)) is given by 

0>,g)={(lO:23: 1), (77 : 80 : 3)} 


10 



The corresponding source section (S"(p, g)) is now given as 


S"(p, q) = {(58 : 60 : 2), (10 : 49 : 3)} 


4 Compiler Support 

In this section we first discuss the additional functionality required in the current version of HPF to support 
multiblock and multigrid codes. We describe how a compiler can analyze the data access patterns associated 
with a loop, to recognize communication patterns which can be handled using the runtime primitives for 
multiblock problems. We then describe the compiler transformations for generating the calls to these runtime 
primitives. We also briefly discuss how loop iterations are distributed to achieve parallelism. 

4.1 Language Support 

The current version of HPF does not support all the functionality required for multiblock and multigrid ap- 
plications. In multiblock problems, the problem geometry is divided into a number of blocks of different sizes. 
As we have discussed in the previous sections, each of these blocks needs to be distributed onto a portion 
of the processor space. Similarly, in multigrid codes, communication overheads can typically be reduced by 
distributing each coarse grid over a part of the processor space [35]. The current version of HPF does not 
provide any convenient mechanism for distributing arrays (or templates) onto a part of the processor space. 
We therefore need additional functionality for conveniently distributing arrays onto part of the processor space. 
In HPF, the programmer declares an abstract processor space by using the processor directive: 

!HPF$ PROCESSORS P(N) 

In general, the abstract processor space can have any number of dimensions. To support block structured 
applications, we need to be able to specify processor subspaces. We declare a processor subspace as follows: 

!HPF$ PSUBSPACE PI IS P(LB:UB) 

The above directive states that PI is the part of the processor space P which starts at processor LB and 
ends at processor UB. In addition, if the processor subspace PI is created from the processor space P, then 
PI must have the same number of dimensions as P. Since the sizes of the blocks are, in general, not known at 
compile-time, the subspace directive must be executable, so that the parameters do not have to be compile- 
time constants. Once a processor subspace has been declared, arrays or templates can be distributed onto it. 
For example, 

!HPF$ TEMPLATE T(100,100) 

!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO PI 

Similar functionality is available in Vienna Fortran [41], where a distribution can be mapped to a processor 
reference. We prefer to explicitly name the processor subspace since it makes it easier to detect at compile-time 
which arrays or templates have been mapped to the same processor subspace, even if the exact size of the 
subspace is not available as c.ompile-time constants. 

With the block distributions supported in the current version of HPF, the entire array gets distributed 
uniformly across the processors of the distributed memory parallel machine. This may not be ideal for load 
balancing for many applications. While the programmer may declare a large array, not all the elements of the 
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array may be actual mesh points participating in computation. Some of the array elements at both ends of 
each dimension may be used for participating in exchanges between blocks. We refer to such array elements 
as external ghost cells . For example, the actual declared arrays for a given block may be 52 x 12 x 12, with two 
external ghost cells at the beginning and end of each dimension. This means that the actual mesh representing 
the computation is of size 48 x 8 x 8. It is these mesh points which must be distributed evenly across the 
processors onto which the block is distributed, so that the computation load will be evenly balanced. The 
external ghost cells at both ends of each dimension are then stored at the first and last processor along that 
dimension in the processor space. For example, if an array with 8 elements, plus two external ghost cells on 
each end (for a total of 12 elements), is distributed on 4 processors, we would like to store 2 mesh points on 
each processor along that dimension. The first and last processors can then store the external ghost cells at 
the beginning and end, respectively. This results in a much better load balance than simply distributing 3 
array elements onto each processor (which will result in the first and last processors having only 1 real mesh 
point and the intermediate processors having 3 real mesh points each). 

The current version of HPF does not provide any mechanism for specifying external ghost cells. We need 
additional functionality in the align statement to express them. We do this by explicitly specifying the number 
of external ghost cells at the beginning and end of each dimension: 

!HPF$ DIMENSION A(105,105) 

!HPF$ ALIGN A(ij) WITH T(i:2:3 j:2:3) 

This example says that an array of size 105x105 is aligned along a template of size 100x100, with 2 ex- 
ternal ghost cells at the beginning of each dimension and 3 external ghost cells at the end of each dimension. 
If the template T is distributed by blocks onto a two dimensional processor space, A(3:102,3:102) also gets 
distributed in the same fashion. Note that our purpose here is not to introduce new syntax but to achieve 
the additional functionality that we need. We believe that this functionality will be added, in some form, in 
a future version of HPF. 

4.2 Identifying Communication Patterns 

In this subsection we discuss how the compiler identifies the communication patterns which can be handled 
using the runtime support for multiblock problems. Note that our purpose is not to provide a general framework 
for compiling /ora// statements; we are only interested in recognizing the patterns that can be handled efficiently 
using the primitives we have developed. 

While we have designed the runtime support with multiblock and multigrid codes in mind, the runtime 
primitives can also be used to efficiently handle communication for many other types of applications. Regular 
section move primitives can be used for handling the communication required when a distribution of an array 
is changed (using the redistributed statement of HPF [12]. They can also be used to handle the communication 
required for filling ghost cells when the data distribution is cyclic or block-cyclic. The primitives can also be 
used for handling communication in forall loops and array expressions in many regular applications, especially 
when strides are involved. 

We do not consider applications in which indirection arrays may need to be analyzed to identify commu- 
nication patterns. The irregular communication arising from use of indirection arrays can be handled using 
the Parti primitives for irregular problems [10], which have also been integrated with compilers for HPF-style 
languages (including the Rice University Fortran 77D compiler [18] and the Syracuse University Fortran 90D 
compiler [4]). F90D and HPF also provide a number of intrinsic functions (such as reduction, spread, etc ). 
We assume that if a computation can be done using these intrinsics, it is either written this way by the 
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programmer or is recognized by the compiler in an early phase of the compilation. 

HPF allows multiple statement forall loops and array expressions for expressing parallelism. We restrict our 
discussion to the problem of analyzing a single statement forall loop for communication patterns. A multiple 
statement forall loop is just like a single statement forall loop, with the multiple assignment statements all 
having the same loop header. The same analysis can be done for each assignment statement within the forall 
loop. The array expressions provided by HPF can always be translated into equivalent forall loops. 

We classify the data access patterns associated with a forall loop as being one of three kinds: 

• Completely regular (not involving any communication). 

• Ones that can be handled by filling in overlap cells. 

• Ones that requires regular section moves. 

Consider any forall statement with array expressions involving an array A in the left hand side and an' 
array B as one of the arrays on the right hand side. The form of the forall statement is assumed to be as follows: 

forall ( 2 '] — lo\ . hl\ . st 1 , . . . , IfYi — lOm ■ him • slm ) 

A(/i , f% , . . . , fj ) — ‘ , <72 j • • ■ ) <7n) 

The u , (& = \..m) are the loop variables associated with the forall statement, /o*, /n* and st * are 
respectively the lower bound, upper bound and the stride for each loop variable. For the left hand side array 
A, / 1 , / 2 , ,fj are the subscripts. Similarly, for the right hand side array B, g 1 , g 2 , .. , g n are the subscripts. 
The form of the array subscripts / and g is assumed to be: 

fk = cl** 1* + d\k 

9k — c2j c i2j c T d2k 

Here, 2*1* and i2* are loop variables. If a subscript is a loop invariant scalar, then we say that the loop 
variable tl* (or, i*2k ), is (f) and cl* (or, c2/t) is 0. cl*, c2*, ofl* and <f2* may be expressions, but we assume 
that they do not involve any loop variable. Our primitives are not applicable for cases in which multiple loop 

variables are associated with a particular array subscript or when the same loop variable appears in more than 

one subscript for a particular array or when a subscript is a higher order function of a loop variable. Such 
cases can be handled by using the Parti primitives for irregular problems. Also, the HPF specification allows 
the lower bound, upper bound and stride expressions for each loop variable to be evaluated in any order. 
Consequently, the lower bound, upper bound and stride for any loop variable are not allowed to be a function 
of any other loop variable. It is possible, in general, that a loop variable may appear only in the right hand 
side array or only in the left hand side array. If a particular loop variable appears only in the right hand side, 
this represents successive overwrites on the same memory location of the left hand side array. Such code is not 
likely to appear in practice and therefore, we do not consider this case. If a particular loop variable appears 
only in the left hand side array, this represents a spread operation. We assume that it is written using the 
intrinsic spread operation, and is not a part of the forall. 

Depending upon how the arrays A and B are distributed and aligned with respect to each other, we consider 
three different cases. These are: 

Case I: Arrays A and B are aligned to different templates. 

Case II: Arrays A and B are identically aligned to the same template. This case also requires that A 
and B are arrays of identical shape and size, i.e. having the same number of dimensions and the same 
size in each dimension. 
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Case III: Arrays A and B are aligned to the same template, but with a different stride, offset and/or 
index permutation. This means that the the arrays A and B are mapped to the same processor subspace, 
but each in a different manner. 

We now discuss how the data access pattern associated with each of these cases is analyzed. The transfor- 
mations required for generating calls to the runtime primitives are discussed in Section 4.3. 

4.2.1 Case I 

Since the arrays are aligned to distinct templates, the communication is always handled using the regular 
section move primitive from the runtime library. We expect that if a user has declared distinct templates then 
they are either distributed over different processor subspaces, or have a different number of distributed dimen- 
sions. Therefore, there is no regularity in the communication associated with a forall statement containing 
references to such arrays. 

It is possible that a programmer may create more than one template with the same number of distributed 
dimensions, distributed over the same processor subspace. We can extend our analysis to consider the processor 
subspace over which distinct templates are distributed in determining any regularity in the communication 
required. However, we do not discuss this possibility here. 

4.2.2 Case II 

The data access patterns associated with this case may be completely regular, or may require the overlap cells 
to filled in, or may require a regular section move. 

Let DD(A) denote the set of dimensions of the array A which are distributed. Under the assumptions for 
Case II, DD(B) = DD(A). In terms of the form of the forall statement and the array subscripts that presented 
in Section 4.2, the condition for the communication associated with the forall to require a regular section move 
is : 

3 j e DD(A) s.t. 

1. ilj i2j , or, 

2. c\j / c2j , or, 

3. dlj ^ d2j and either dlj or d2j is not a compile-time constant, or, 

4. ilj = <i> } i2j = 4> and dlj ^ d2j. 

The first condition states that there is loop index permutation. In that case, a regular section move will be 
required. The second condition states that, along the j th dimension, the elements of the arrays A and B are 
being accessed with different strides. Again, this case will require a regular section move. The third condition 
corresponds to the fact that there are non-constant offsets. If there are constant offsets, then only the overlap 
cells need to be filled in. For overlap cells, space needs to be allocated at compile-time, so the number of overlap 
cells must be known at the compile-time. If the offsets are not compile- time constants, then we use a regular 
section move to handle communication. This situation can also be handled by shifts into a temporary array [4]. 
The fourth condition says that along dimension j a loop variable does not appear in either the left hand side or 
the right hand side index and the loop invariant scalars are different. This represents a copy from one location 
to another, but because of the loop variables associated with other dimensions, will typically require a regular 
section of data to be moved. Since the distributed array descriptor is not available at compile-time, it cannot 
be determined at compile- time whether this data move will require any in ter processor communication. So we 
handle this kind of data move using the regular section move primitives we have already discussed. 

The data access pattern requires filling in overlap cells, if the following condition holds: 
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Arrays A and B are aligned identically 


L.H.S. 

Expression 

R.H.S. 

Expression 

Regular Section 
Move Required 

Overlap Cell 
Fill Required 

A(ij) 

B(j+2,i+l) 

YES 

NO 

A(ij) 

B(2*i j) 

YES 

NO 

A(ij) 

B(i+n 1 j+2) 

YES 

NO 

A(i j) 

B(i+lj+2) 

NO 

YES 

A(ij) 

B(i j) 

NO 

NO 


Figure 3: Analyzing communication for Case II 

1. A regular section move is not required and 

2. 3j e DD(A) s.t. dlj £ d2j. 

The second condition states that there is a difference in the offsets along some (distributed) dimension. 
Overlap cells must be filled along each dimension in which there is a difference in the offsets. In Figure 3 we 
show examples for the different possibilities within case II, for identically aligned two dimensional arrays A 
and B. 

4.2.3 Case III 

In this case, arrays A and B are aligned to the same template (T), but in different fashions, We consider 
only the cases when A and B are aligned to T in such a way that none of the dimensions of either A or B is 
replicated. In this case, the number of distributed dimensions of A and B would be identical, and will be equal 
to the number of distributed dimensions of T. Consider any distributed dimension j of A (i.e. j £ DD(A)). 
We use AD(A,j) to denote the dimension of the template T along which the distributed dimension j of the 
array A is aligned. We use map(j) to denote the dimension of the right hand side array B which is aligned to 
the same dimension of the template T as dimension j of the array A. Formally, 

map(j) = k 3 /s.t. AD(A,j) — l A AD(B,k) = l . 

Since each of the dimensions of A and B are distributed along exactly one dimension of the template T (as 
required by HPF), map(j) is defined and is unique for each distributed dimension of A. 

For the purpose of the discussion, we assume that the arrays A and B are aligned as follows: 

!HPF$ ALIGN A (k u . . . , kj , . . . Jb,) WITH T(hl u . . . , hly : ExtJow 1, : EztJiighl j9 ...,hl p ) 

!HPF$ ALIGN B (Jfci, .Ar„) WITH T(/i2 1} . . .,/i2j» : EztJow2j : Ext JxigWlj , . . . , h2 p ) 

where; 

P - \DD(A)\ 

f = AD(AJ) 
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j" = AD(BJ) 
hljt = a\j * kj + 61j 

h2j" = a2j * kj -f 62j 

In the above, dimension j of the array A is aligned with dimension j f of the template T. Similarly, dimension 
j of the array B is aligned with dimension j n of the template T. Ext Jowl j and Ext Jiighlj are, respectively, 
the number of external ghost cells at the beginning and end of dimension j of the Array A. Similarly, ExtJow2j 
and ExtJiigh2j are the number of external ghost cells at the beginning and end, respectively, of dimension j 
of Array B. 

If we view the computation as accessing elements of the template T, then the effective offset for the left 
hand side array reference along dimension j of the array A is dlj*alj + b\j — Ext Jowl j . Similarly, the effective 
offset for the right hand side array reference along dimension j of the array B is d2j * a2j + b2j — ExtJow2j . 
The data access pattern ii) the forall loop will require a regular section move if the following condition holds: 

3 j € DD(A) s.t . 

1* ^map(j) > or 

2. clj * alj ^ c2 map(j) * or 

3. dlj ♦ ctlj 3~ 61 j E xt Jowl j ^ d2 rna p^j'^ ♦ a2, Tl ap^j^ 3- b2f na p^j^ E xt Jow2fyiap^j ^ 

and either of these effective offsets is not a compile- time constant, or 

4. ilj — <|6, %2j — (j) and blj Ext Jowl j ^ ) ExtJow2 ma p^j^. 

As in Case II, the first condition states that there is loop index permutation. The second condition implies 
that there is a difference in the effective stride taken along any dimension. The third conditions says that 
the offsets may be distinct and one of them is not a compile-time constant. The fourth condition says that 
a rectilinear section of data needs to be moved along a certain dimension. Suppose that ilj = ijti and 
i2map(j) = *Jk2* If we view the loop iterations as accessing elements of the template T, clj * stk i * alj is the 
effective stride of the subscript on the left along dimension AD(A ) j) and similarly c2 map (j) * stk 2 * a< ^map(j) 
the effective stride of the subscript on the right hand side along dimension AD(B y map(j)). By the definition 
of map(j), AD(A,j) = AD(B,map(j)). If ilj and i2 ma p(j) are identical, then Arl = k2. So, the required 
condition for the effective stride for left and right side to be identical is clj * alj = c2 map (j) * a2 map (j). 

The data access pattern will require filling in overlap cells if the following condition holds: 

1. A regular section move is not required and 

2. 3 j E DD(A) s.t. 

dlj * alj + 61 j — Ext Jowl j ^ d2 ma p(j) * a2 ma p(j) 3- b2 ma p(j^ ExtJow2 ma p(j^ 

The first condition says that we have not already decided that a regular section move was required. The 
second condition says that the effective offsets are not identical. In Figure 4, we give several examples showing 
the results of the analysis for Case III. 

4.3 Generating calls to the runtime library 

Once the nature of the communication required has been identified, the compiler must insert the appropriate 
calls to the runtime primitives. We first discuss how the calls are made to the routines for filling in ghost ceils. 
We discuss this in the context of Case III from the previous section, since Case II is really a special case of 
Case III. 
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ALIGN A(i j) WITH T(i j) 


ALIGN B(i j) WITH T(2*j,i+3:2:1) 


L.H.S. 

Expression 

R.H.S. 

Expression 

Regular Section 
Move Required 

Overlap Cell 
Fill Required 

A(ij) 

B(ij) 

YES 

NO 

A(ij) 

B(j,i) 

YES 

NO 

A(2*i j) 

B(j,i) 

NO 

YES 

A(2*ij) 

B(j,i+3) 

NO 

YES 

A(2*i j) 

B(j,i+1) 

NO 

NO 


Figure 4: Analyzing communication for Case III 

We identify each distributed dimension j of the array A for which 

d\j * otlj -f- 61 j Ext — lowlj ^ d2 ma p( jij * g 2 map(j) ~F b2 m ap(j) Ext-low2 ma p^jy 

One call to the schedule building primitive Overlap -CelLFillSched and one to the data moving primitive 
Data-Move is inserted for each such dimension. Since all computations are distributed using the owner 
computes rule, overlap cells are filled in for the right hand side array B. For the Overlap-Cell-FillSched 
call, the dimension of the move is map(j) and the number of overlap cells to be filled in is rf2 map y) *a2 map ( ; ) + 
"* _ Ext —i low ^ dlj ♦ a lj j Ext—lotv\j. 

The schedule building primitive is called with the Distributed Array Descriptor (DAD) of the array as a 
parameter. The actual array storage location need not be specified. A call to Data-Move is then made which 
uses the previously built schedule to copy the data. 

We now discuss how calls to the primitive for moving regular sections are inserted. If there is more than 
one array on the right hand side, then the analysis described in Section 4.2 is done for each such array. For 
each of the right hand side arrays which requires a regular section move, a temporary array is declared and 
a regular section move is done from the right hand side array into the temporary array. If there is only one 
array on the right side (i.e. the forall loop represents only a copy and does not have any computation), then 
the regular section move is performed directly from the right hand side array to the left hand side array. 

The parameters of Regular-section_movesched are assigned as follows. In the forall loop, ij is the j th 
loop variable. The total number of loop variables is m. For each of the loop variables, we identify the 
dimensions corresponding to the subscripts of A and B where they appear. Srcdim(j) and Destdim(j) denote 
the dimensions of the source array B and the destination array A (or a temporary array) that correspond 
to the j th dimension of the regular section being moved. Note that Srcdim(j) is not necessarily j since the 
forall loop allows arbitrary permutation among dimensions. If i 1 * i = ij and 2*2*2 = ij , then SrcDimQ) is 
assigned k\ and DestDim(j) is assigned k*2. The remaining elements of Srcdims and Destdims are assigned 
the remaining dimensions of A and B whose subscripts are loop invariant scalars (the exact ordering is not 
important). 

SrcLos(k) and SrcHis(k) denote the start and end points, respectively, along each dimension of the source. 
If 2 * 2 * = <f), then, SrcLos(k) = SrcHis(k) = d2*. Otherwise, if 2*2* = ij, for some j, then, SrcLos(k) = 
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C ORIGINAL F90D CODE 

C Arrays A, B and C are distributed identically 

FORALL (i = 1:100 j = 1:100) A(ij) = B(i+lj) + C(ij) 

C TRANSFORMED CODE 

Dim = 1 

No.Of-Cells = 1 

sched = O verlap -Cell Jill JSched( DAD, Dim, No_Of_.Cells) 

C DAD is distributed array descriptor for A, B and C 
C i is dimension 1, j is dimension 2 

Call Data_Move(B, sched, B) 

LI = Local_Lower_Bound(DAD,l) 

L2 = Local-Lower Jknind(DAD, 2) 

HI = LocaLUpper_Bound(DAD,l) 

H2 = LocaLUpper_Bound(DAD,2) 

do 10 i = LI, HI 
do 10 j = L2, H2 

10 A(ij) = B(i-f 1 j) 4- C(ij) 


Figure 5: Overlap cell fill and loop bounds adjustment example 

c2 fc * (loj) + d2 k and similarly, SrcHis(k) = c2* * (hij) + d2 k . For the destination, the low and high indexes 
are computed in the same manner. 

SrcStr(k) denotes the stride of the move along each dimension of the source. If i2 k = <f>, then, SrcStr(k) 
doesn’t matter. Otherwise, if i2 k = ij , for some j , then, SrcStr(k) = c2 k * stj . For the destination, the strides 
are computed in the same manner. 

4.4 Distributing loop iterations 

Once the calls have been inserted for communicating the required array elements, the loop iterations must be 
distributed among the processors. As we stated earlier, this is done using the owner computes rule. Since the 
distributed array descriptors are built at runtime, it is not possible to compute the local loop bounds on each 
processor at compile-time. 

Consider any loop variable ij, Let i2 kx = ij. The loop accesses elements ranging from cl H *loj + dl kl to 
cl*i *hij +dl*i- We partition the loop based upon the portion of the distributed arrays that are owned by a 
given processor. This is done by inserting runtime calls to the the library primitives LocaLLower-Bound and 
LocaLUpper-Bound. Note that for arrays which are not in canonical form (i.e. where cl ; ^ 1 or dlj ^ 0 for 
some j ), we can still partition the loop based upon the owners compute rule. Consequently, we never need 
to scatter any data after the loop has been executed. 

In Figure 5 we show an example of how the calls to primitives for filling in overlap cells are inserted by the 
compiler. In Figure 6, we show how the compiler inserts calls to the primitives for moving regular sections. 
In both examples, the transformed code containing the calls to the runtime library will run as SPMD code 
on each processor of the distributed memory parallel machine. Note that in the compiler generated code, 
schedule building primitives will be called every time any forall loop requiring communication is executed. 
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C ORIGINAL F90D CODE 

C Arrays A, B are distributed identically 

forall (i = 1:100:2 j = 1:50) A(ij) = B(2*j,i) 

C TRANSFORMED CODE 


NumSrcDim = 

2 

Num DestDim = 

2 

SrcDim(l) = 

2 

DestDim(l) = 

1 

SrcDim(2) = 

1 

DestDim(2) = 

2 

SrcLos(l) = 

2 

DestLos(l) = 

1 

SrcLos(2) = 

1 

DestLos(2) = 

1 

SrcHis(l) = 

100 

DestHis(l) = 

100 

SrcHis(2) = 

100 

DestHis(2) = 

50 

SrcStr(l) = 

2 

DestStr(l) = 

2 

SrcStr(2) = 

2 

DestStr(2) = 

1 

Sched = Regular JSection-MoveJSched(DAD,DAD,NumSrcDim,NumDestDim ) 

SrcDim, SrcLos, SrcHis, SrcStr, 

DestDim, DestLos, DestHis, DestStr) 

Call Data_Move(B,Sched,A) 


Figure 6: Regular section move example 
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The hand coded version can build a schedule once and reuse it in subsequent iterations. Similarly, in the 
compiler generated code, runtime calls to the loop bound adjustment primitives will be made each time a loop 
is executed. The hand coded version can reuse the adjusted bounds over the multiple time steps, and also 
for multiple loops that have the same loop bounds. These two factors may cause compiler generated code to 
perform worse than hand parallelized code. 

5 Experimental Results 

In this section we present experimental results to demonstrate the efficacy of our approach. We are interested 
in two different factors, performance of the library primitives and the effectiveness of the compiler. We study 
the first factor by measuring the runtime overhead incurred in using our library primitives as compared the 
bare cost of communication associated with the best possible hand parallelized codes. We study the latter 
factor by comparing the performance of compiler parallelized codes with the codes parallelized by manually 
inserting calls to the library functions. We also study the effect of data distribution on the performance of 
these codes. 

We have experimented with two major codes: a template from a multiblock Navier Stokes’ solver and a 
semi-coarsening multigrid code. We have parallelized a template from a multiblock computation fluid dynamics 
application that solves the thin-layer Navier-Stokes equations over a 3D surface (multiblock TLNS3D), using 
our prototype Fortran 90D compiler. The multiblock TLNS3D code we are working with was developed by 
Vatsa ei ai [40] at NASA Langley Research Center, and consists of nearly 18,000 lines of Fortran 77 code. 
The template, which was designed to include portions of the entire code that are representative of the major 
computation and communication patterns of the original code, consists of nearly 2,000 lines of F77 code. We 
have also worked with a semi-coarsening multigrid code [35]. This has nearly 2,500 lines of F77 code. In all 
the experiments described in this section, performance data is presented starting from the minimum number 
of processors which provided sufficient memory for executing the program up to 32 processors. 

5.1 Overhead of Primitives 

We are interested in evaluating the performance of a code parallelized using our primitives as compared the 
performance of the “best hand-parallelized” code. By hand parallelized code, here we mean parallelized by 
inserting calls to the communication routines provided by the distributed memory machine and not using our 
library routines. Note that, to the best of our knowledge, no complete block structured code has yet been hand- 
parallelized on a distributed memory parallel machines. The reason is that, for an application programmer 
parallelizing such a code with hand, it is very difficult to analyze the exact communication required in these 
codes and then be able to use the communication routines available on the parallel machine to handle it 
efficiently. The use of library primitives involves runtime overheads because of generating schedules, overhead 
of copying data to be communicated into buffer at source processors, and similarly the overhead of copying 
the received data into appropriate memory locations at the destination processors. The possible advantage (in 
terms of efficiency) of the library primitives is that, for each invocation of a data-move, each processor sends at 
most one message to each other processor. It may, in general, be very difficult for an application programmer, 
parallelizing the code by hand, to do such message aggregation. However, the best performance that an 
application programmer can ever achieve will only have the cost of actual communication and computation, 
assuming that messages have been aggregated to reduce the effect of communication latencies. We will 
study the overheads incurred in a code parallelized using our primitives as compared to the best possible 
performance of hand parallelized code, which incurs the minimum communication costs (assuming maximum 
message aggregation). 
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ONE BLOCK: 49 X 9 X 9 Mesh (50 Iterations) 


Number of 
Processors 

Compiler 

Parallelized 

Hand 

Parallelized F90 

Hand 

Parallelized F77 

4 

6.99 

6.88 

6.20 

8 

4.17 

4.06 

4.00 

16 

2.47 

2.35 

2.28 

32 

1.55 

1.45 

1.41 


Figure 8: Performance comparison for small mesh, one block (in seconds) on iPSG/860 

We consider a simple code executed on 2 processors in which a regular section move involves moving data 
from processor 0 to processor 1. We vary the number of bytes involved in the regular section moves and 
measure three sets of timings: the time required just for communication, the time required for communication 
when library primitives are used (excluding the cost of schedule building) and the total time required when the 
library primitives are used (including time for schedule building). The first set of timings represents the best 
performance that hand parallelization can achieve if all the data elements to be communicated are laid out 
contiguously. If the data elements to be communicated are not contiguous, then the application programmer 
will need to do copying to aggregate message. The second set of timings represents this case. The third set 
of timings represent the performance with the use of library primitives. The timings presented are for 100 
iterations of the regular section move. 

The performance results on an iPSC/860 are presented in Figure 7. The results show that the cost 
of copying (difference between Set I and Set II) is typically a small fraction (less than 5%) of the cost of 
communication for most of the cases. Also, if the schedule built is used over a large number of iterations, 
the cost of building the schedule is also a small fraction of the cost of communication. We have performed 
similar experiments on Thinking Machines’ CM-5. Our experiments have shown that the runtime overheads 
associated with the use of primitives are again a small fraction of the bare cost of communication. 

5.2 Multiblock Code 

We have parallelized a multiblock template using our compiler. We hand parallelized this template by manually 
inserting calls to the multiblock Parti routines. (Note that this is different from the hand parallelization we 
talked about in the previous subsection.) We converted the F77 (sequential) code to F90D manually, by 
rewriting the the major computational parts of the code using forall loops and F90 array expressions, also 
adding the required data distribution directives. We then parallelized the code by running it through the 
F90D compiler. We also created a hand parallelized F90 version of the template in which all computations 
are done with single statement fovall loops, but the calls to the runtime primitives are inserted manually. 

We now compare the relative performances of compiler parallelized F90 code, hand parallelized F90 code 
and hand parallelized F77 code, varying the mesh size and number of blocks for the application, and also 
varying the number of processors used on an Intel iPSC/860. we used the minimum number of processors In 
Figure 8, we present the performance results on a 49 x 9 x 9 mesh (with one block), comparing the performance 
of the three versions from 4 to 32 processors. In Figure 9, we present the performance results on a 49 x 17 x 9 
mesh (split into two blocks), comparing the performance of the three versions from 8 to 32 processors. The 
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TWO BLOCKS: 49 X 9 X 9 Each (50 Iterations) 


Number of 
Processors 

Compiler 

Parallelized 

Hand 

Parallelized F90 

Hand 

Parallelized F77 

8 

7.49 

6.69 

6.17 

16 

4.64 

4.07 

4.03 

32 

2.88 

2.32 

2.30 


Figure 9: Performance comparison for larger mesh, two blocks (in seconds) on iPSC/860 

template is communication intensive and therefore the absolute speedups are not very high in either of the 
versions. The compiler parallelized F90 code performs within around 20% of the hand parallelized F77 code. 
The hand parallelized F90 code performs worse than the hand parallelized F77 code. This is because, in the 
F90 version, all computation is done through single statement forall loops that result in the creation of (large) 
temporary arrays. Such use of temporary storage, and the fact that no loop fusion between parallel loops 
is done by the compiler, increases the number of cache misses on each processor. However, the difference 
in performance between the F90 and F77 hand parallelized versions decreases as the number of processors 
increases. This is because as the number of processor increases, less memory is required on each processor, so 
the effect on cache utilization is less significant. The difference in performance of the hand parallelized F90 
and the compiler parallelized code comes from two major factors. First, in the compiler generated version, the 
runtime calls for computing new loop bounds are made in each loop iteration, as compared to only once for 
the hand parallelized version. Second, as the template is run over a large number of time steps, the compiler 
generated version makes repeated calls to the runtime library to build communication schedules, whereas in 
the hand parallelized version the calls are lifted out of the time step loop. To reduce the additional cost 
due to this second factor, our runtime library library saves schedules. When a call is made for generating 
the schedule, the library searches a hash table to check if any schedule with exactly the same parameters is 
present. If so, the saved schedule is returned. This technique still has this overhead as compared to a hand 
parallelized version. To study the exact costs of each of these factors, we present a more detailed experiment 
in Section 5.4. 

5.3 Multigrid Code 

We have also experimented with a semi-coarsening multigrid code developed by Rosendale and Overman [35]. 
This has nearly 2,500 lines of F77 code. We discussed the semi-coarsening multigrid technique and the mapping 
policy used in parallelizing such an application earlier in Section 2. 

We rewrote this code using forall loops and including the distribution directives and then parallelized 
it using our compiler. This code had also been parallelized by inserting the calls to the library routines 
manually [35]. In Figure 10, we show the performance comparison of these two parallel versions run on Intel 
iPSC/860. The results are for a 32x32x32 grid, using a coarsening factor of 4 along each dimension. The code 
uses 8 different grids at four different levels. We did not create a separate hand parallelized F90 version since 
most of the subroutines in this code are fairly small and therefore rewriting it in F90 using forall loops did not 
involve introducing large temporary arrays. Consequently, we did not expect to see any notable difference in 
the performance of hand parallelized F90 and F77 versions. 

The results of the performance of compiler parallelized and hand parallelized multigrid code are presented 
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Figure 10: Semi-Coarsening multigrid performance (in seconds) on iPSC/860 


TWO BLOCKS : 25 X 9 X 9 Each (50 Iterations) 


No. of 
Proc. 

Compiler 
Version I 

Compiler 
Version II 

Compiler 
Version III 

Compiler 
Version IV 

Hand 

F90 

4 

13.45 

7.63 

7.41 

7.33 

6.79 

8 

15.51 

4.78 

4.58 

4.54 

4.19 

16 

11.72 

2.85 

2.71 

2.62 

2.39 

32 

8.01 

1.85 

1.79 

1.66 

1.47 


Version I 
Version II 
Version III 
Version IV 


Runtime Library does not save schedules 
Runtime Library saves schedules 
Schedules reuse implemented by hand 
Loop bounds reused within a procedure 


Figure 11: Effects of various optimizations (in seconds) on iPSC/860 

in Figure 10. The results show that the compiler parallelized code performs within 10% of the hand parallelized 
code in this case. Again, as this code was very communication intensive, the absolute speedup is not very high 
for either version. 

5.4 Compiler Optimizations 

In Figure 11, we study the effect of the compiler optimizations. Version I is a compiler parallelized version 
in which the library does not save any schedules. This version performs badly because of the high cost of 
rebuilding the schedules for every iteration. Version II is the compiler parallelized version in which the library 
saves schedules. This results in a major gain in performance. We now discuss some optimizations which 
are not implemented in the current compiler. We studied the effect of these optimizations by modifying the 
compiler generated code by hand. Version III represents the case where the compiler performs sophisticated 
interprocedural analysis to reuse the schedules during successive time steps. Version III performs better than 
version II, in which the schedules are reused within the library, but the difference is not large. 

In the compiler parallelized version, runtime calls are made to the functions for adjusting loop bounds for 
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TWO BLOCK: 49 X 17 X 9 Mesh (50 Iterations) 
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16 

3.24 

2.83 

32 

2.41 

1.87 


Figure 12: Effect of Data Distribution on iPSC/860 


each forall loop on each time step. The hand parallelized version can store the loop bounds computed during 
the first time step, for subsequent reuse. Additionally, a procedure may contain several loops involving the 
same array on the left hand side that have the same loop bounds. Our compiler generates separate runtime 
calls for adjusting loop bounds for each such loop. Such optimizations will be implemented in a future version 
of the compiler. 

In Figure 11, the difference between version III and the hand parallelized F90 version shows the extra 
cost of generating loop bounds at runtime for each forall loop during each time step. The results show that 
generating loop bounds at runtime is the major factor in the performance difference between the compiler 
parallelized version and the hand parallelized versions. In version IV, we show the results of an unimplemented 
optimization in which the compiler is able to identify the loops with the same left hand side array and same 
loop bounds within a subroutine. Then the compiler needs to generate calls to the loop bound adjustment 
functions only once for each such set of loops. This optimization also provides an improvement over version 
III. 

5.5 Effect of Data Distributions 

As we discussed earlier, one of the features of our runtime library is the ability to map arrays (or templates) 
to subsets of the processor space. In the current definition of HPF (and hence in HPF compilers), this is 
not possible. In block structured codes, this feature allows us to keep the communication overheads low 
while maintaining the load balance. To study the benefit of this feature, we experimented with the multigrid 
template described above for two block case. We ran the parallelized code, once distributing both the blocks 
over the entire processor space and then distributing each block over disjoint processor spaces. The results on 
Intel iPSG/860, shown in the Figure 12, show that the latter scheme improves the performance by nearly 10 
to 25%. Since there is no difference in the net computation performed at each Processor in either of the the 
two cases, this difference comes because of the increased amount of communication required when each block 
is distributed across the entire processor space. Mapping a block over a large number of processors increases 
communication arising from near neighbor interactions during the regular computation within blocks. Note 
that a 10 to 25% degradation in performance occurs when there are only two blocks. We expect that with a 
larger number of blocks, the difference in the performance would be much more severe. 
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6 Conclusions 


To reliably and portably program distributed memory parallel machines , it is important to have both a 
machine independent language and runtime support for optimizing communication. High Performance Fortran 
and its variants have emerged as the most likely candidates for machine independent parallel programming on 
distributed memory machines. One class of scientific and engineering applications involves structured grids or 
meshes. These meshes may be nested (as in multigrid codes) or may be irregularly coupled (called multiblock 
codes or Irregularly Coupled Regular Meshes). Multiblock and multigrid codes form a significant part of 
scientific and engineering applications. 

In this paper we have addressed the problem of runtime, compiler and programming language support 
for parallelizing this important class of applications on distributed memory machines. We have designed and 
implemented a set of runtime primitives for parallelizing these applications in an efficient, convenient and 
machine independent manner. The runtime primitives give ability to specify data distributions, perform com- 
munication and distribute loops based on data distributions specified at runtime. One of the communication 
primitives in our library is the regular section move, which can copy a rectilinear part of a distributed array 
onto a rectilinear part of another distributed array, potentially involving index permutations, change of strides 
and change in offsets. We have presented runtime analysis which can implement this communication primitive 
efficiently. 

For making the task of application programmers easy, it is important to have compiler support. In this 
paper, we have presented techniques that can be used by compilers for HPF-style programming languages to 
automatically generate calls to the runtime primitives. We have presented the method by which the compiler 
can analyze the data access patterns associated with parallel loops and therefore identify communication 
patterns which can be efficiently handled using the communication primitives that the multiblock Parti library 
supports. We have also presented compiler transformations that the compiler performs for automatically 
generating calls to the runtime primitives. 

We have implemented the compiler analysis method in the Fortran 90D compiler being developed at 
Syracuse University. We consider this work to be a part of an integrated effort toward developing a powerful 
runtime support system for a F90D compiler. We have experimented with a template from a 3D multiblock 
Navier-Stokes solver and a multigrid code. 

For demonstrating the efficacy of our approach, we examined two separate factors: performance of the 
runtime primitives and performance of compiler parallelized code as compared to the code parallelized by 
inserting calls to the runtime primitives by hand. We examined the additional cost of using library primitives 
as compared to the minimum cost of communication. Performance results show that the additional cost in 
using the library primitives (schedule building and data copying) is a small fraction of the minimum cost 
of communication. One of the features of our library which is not supported in current version of HPF 
(and consequently in HPF compilers) is the ability to map arrays or blocks over part of the processor space. 
We presented results with TLNS3D template to show the improvement in performance achieved because 
of this feature. We compared the performance of compiler parallelized code with the performance of hand 
parallelized F90 and F77 codes, and have shown that the compiler parallelized code performs within 20% of 
hand parallelized F77 code. The optimization of having the runtime library save and reuse communication 
schedules allows the compiler parallelized code to perform almost as well as hand parallelized code. We have 
also experimented with other optimizations. The optimization of reusing computed loop bounds within a 
subroutine improves the performance of the compiler parallelized code and brings it within 10% of the hand 
parallelized version. 

To the best of our knowledge, we are not aware of any real block structured codes which have been paral- 
lelized on any distributed memory machine. The reason is that, for an application programmer parallelizing 
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such an application by hand, it is very difficult to analyze the exact communication required and then to be 
able to use the communication routines provided by the machine to communicate efficiently. Our runtime 
and compiler support can be used to parallelize such applications conveniently. Our experimental results have 
shown that the code parallelized by using the compiler will have only a small overhead as compared to the 
best hand parallelized code (i.e. parallelized by invoking system’s communication primitives by hand). 

While the design of our runtime system was motivated by multiblock and multigrid applications, our 
runtime primitives can be used in many cases for regular codes as well. We therefore, believe that our runtime 
support and compiler techniques can be used by compilers of HPF-style parallel programming languages in 
general. 
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with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our 
primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% 
of the code parallelized by manually inserting calls to the runtime library. 


14. SUBJECT TERMS .. J 

PARTI; tools; compilers; multiblock; multigrid; computational fluid dynamics 


15. NUMBER OF PAGES 

32 


16. PRICE CODE 

A03 


17. SECURITY CLASSIFICATION 
OF REPORT 

Unclassified 


18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 
OF THIS PAGE OF ABSTRACT 

Unclassified 


20. LIMITATION 
OF ABSTRACT 


NSN 7540-01-280-5500 


Standard Form 298(Rev. 2-89) 

Prescribed by ANSI Std. Z39-18 
298-102 


☆ u.S. GOVERNMENT PRINTING OFFICE: 19*3 - S2S-M4/S6090 










